Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
debakarr
GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 7 - Natural Language Processing/[Python] Natural Language Processing.ipynb
1002 views
Kernel: Python 3

Natural Language Processing

Data Preprocessing

# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd import re # for regex # import nltk # The Natural Language Toolkit library # nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import CountVectorizer # for tekenization %matplotlib inline plt.rcParams['figure.figsize'] = [14, 8]
dataset = pd.read_table('Restaurant_Reviews.tsv') # tsv stands for tab seperated variable
dataset.head(10)
len(dataset)
1000

Cleaning the texts

corpus = [] for i in range(0, 1000): # Keep only alphabets and replace any other character with a whitespace review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # Change evrything to lowercase review = review.lower() # Remove the non significant words eg. 'the', 'a', 'an', 'in', 'on' i.e all the preposition and articles # Thereafter stemming review = review.split() ps = PorterStemmer() review = [ps.stem(word) for word in review if not word in stopwords.words('english')] # Change list back to sentence review = ' '.join(review) # Append newly generated sentence in corpus corpus.append(review)
dataset.head(10)
corpus[0:10]
['wow love place', 'crust good', 'tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price', 'get angri want damn pho', 'honeslti tast fresh', 'potato like rubber could tell made ahead time kept warmer', 'fri great', 'great touch']

Creating the Bag of Words model

cv = CountVectorizer(max_features = 1500) X = cv.fit_transform(corpus).toarray()
X.shape
(1000, 1500)
y = dataset.iloc[:, 1].values
y[0:10]
array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1])

Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

Fitting Naive Bayes to the Training set

from sklearn.naive_bayes import GaussianNB classifier = GaussianNB() classifier.fit(X_train, y_train)
GaussianNB(priors=None)

Predicting the Test set results

y_pred = classifier.predict(X_test)

Making the Confusion Matrix

from sklearn.metrics import confusion_matrix cm_nb = confusion_matrix(y_test, y_pred) cm_nb
array([[ 66, 62], [ 18, 104]])

Homework

1. Run the other classification models we made in Part 3 - Classification, other than the one we used in the last tutorial.

Decision Tree

# Fitting Decision Tree Classification to the Training set from sklearn.tree import DecisionTreeClassifier classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm_dt = confusion_matrix(y_test, y_pred) cm_dt
array([[94, 34], [50, 72]])

Random Forest Classification

# Fitting Random Forest Classification to the Training set from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm_rf = confusion_matrix(y_test, y_pred) cm_rf
array([[113, 15], [ 56, 66]])

2. Evaluate the performance of each of these models. Try to beat the Accuracy obtained in the tutorial. But remember, Accuracy is not enough, so you should also look at other performance metrics like Precision (measuring exactness), Recall (measuring completeness) and the F1 Score (compromise between Precision and Recall). Please find below these metrics formulas (TP = # True Positives, TN = # True Negatives, FP = # False Positives, FN = # False Negatives):

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1 Score = 2 * Precision * Recall / (Precision + Recall)

Accuracy, Precision, Recall, F1 Score of Naive Bayes

A = (cm_nb[0][0] + cm_nb[1][1])/np.sum(cm_nb) P = (cm_nb[1][1])/(cm_nb[1][1] + cm_nb[0][1]) R = (cm_nb[1][1])/(cm_nb[1][1] + cm_nb[1][0]) print('Accuracy of Naive Bayes:', A) print('Precision of Naive Bayes:', P) print('Recall of Naive Bayes:', R) print('F1 Score of Naive Bayes:', 2 * P * R / (P + R))
Accuracy of Naive Bayes: 0.68 Precision of Naive Bayes: 0.626506024096 Recall of Naive Bayes: 0.852459016393 F1 Score of Naive Bayes: 0.722222222222

Accuracy, Precision, Recall, F1 Score of Decision Tree

A = (cm_dt[0][0] + cm_dt[1][1])/np.sum(cm_dt) P = (cm_dt[1][1])/(cm_dt[1][1] + cm_dt[0][1]) R = (cm_dt[1][1])/(cm_dt[1][1] + cm_dt[1][0]) print('Accuracy of Decision Tree:', A) print('Precision of Decision Tree:', P) print('Recall of Decision Tree:', R) print('F1 Score of Decision Tree:', 2 * P * R / (P + R))
Accuracy of Decision Tree: 0.664 Precision of Decision Tree: 0.679245283019 Recall of Decision Tree: 0.590163934426 F1 Score of Decision Tree: 0.631578947368

Accuracy, Precision, Recall, F1 Score of Random Forest

A = (cm_rf[0][0] + cm_rf[1][1])/np.sum(cm_rf) P = (cm_rf[1][1])/(cm_rf[1][1] + cm_rf[0][1]) R = (cm_rf[1][1])/(cm_rf[1][1] + cm_rf[1][0]) print('Accuracy of Random Forest:', A) print('Precision of Random Forest:', P) print('Recall of Random Forest:', R) print('F1 Score of Random Forest:', 2 * P * R / (P + R))
Accuracy of Random Forest: 0.716 Precision of Random Forest: 0.814814814815 Recall of Random Forest: 0.540983606557 F1 Score of Random Forest: 0.650246305419